Travel Package Purchase Prediction - Visit with us

Solution with Ensemble Techniques

Background and Context

You are a Data Scientist for a tourism company named "Visit with us". The Policy Maker of the company wants to enable and establish a viable business model to expand the customer base.

A viable business model is a central concept that helps you to understand the existing ways of doing the business and how to change the ways for the benefit of the tourism sector.

One of the ways to expand the customer base is to introduce a new offering of packages.

Currently, there are 5 types of packages the company is offering - Basic, Standard, Deluxe, Super Deluxe, King. Looking at the data of the last year, we observed that 18% of the customers purchased the packages.

However, the marketing cost was quite high because customers were contacted at random without looking at the available information.

The company is now planning to launch a new product i.e. Wellness Tourism Package. Wellness Tourism is defined as Travel that allows the traveler to maintain, enhance or kick-start a healthy lifestyle, and support or increase one's sense of well-being.

However, this time company wants to harness the available data of existing and potential customers to make the marketing expenditure more efficient.

You as a Data Scientist at "Visit with us" travel company have to analyze the customers' data and information to provide recommendations to the Policy Maker and Marketing Team and also build a model to predict the potential customer who is going to purchase the newly introduced travel package.

Data Dictionary

Customer details:

Customer interaction data:

Objective

To predict which customer is more likely to purchase the newly introduced travel package.

Build Ensemble Techniques Bagging & Boosting Models that will help the marketing department to identify the potential customers who have a higher probability of purchase the newly introduced travel package.

Solution Contents

Understand Given Data

Read given data to data frame and understand data nature, given features, total records, given data has any missing values or duplicate data, outliers.

Visualize data and and understand data range and outliers

Loading necessary libraries for EDA

Load all standard python library packages.

Data Manipulation

Data Visualization

Load data to dataframe

Read given xlsx file Tourism.xlsx and load to data frame data.

View the first and last 5 rows of the dataset.

Understand the shape of the dataset.

observations on data

Check the data types of the columns in the dataset.

checking data types of all columns

observations on data types

Summary of the data

observations on data

Customer details:

Customer interaction data:

Check for Duplicates

lets check for any duplicate values

No Duplicate rows, No actions reqd.

Missing Value Treatment - Data Pre-processing

Let's check for missing values

lets check which columns has some null values, how many null values

observations on data missing

lets check which how many rows has some null values

observations on data missing by row

checking rows with 3 columns has missing values

Decision on missing values with 3 values

Decision on missing values with 2 values

Decision on missing values with 1 values

Missing Values treatment - with Median Imputation

Lets fix all missing values before proceeding further

Missing value treatment - Monthly Income

Print and check missing values and see how can we create condition and determine imputation value

Missing value treatment - Age

Print and check missing values and see how can we create condition and determine imputation value

Missing value treatment - Duration Of Pitch

Print and check missing values and see how can we create condition and determine imputation value

Missing value treatment - Number Of Trips

Print and check missing values and see how can we create condition and determine imputation value

Missing value treatment - Number Of Children Visiting

Print and check missing values and see how can we create condition and determine imputation value

Missing value treatment - Preferred Property Star

Print and check missing values and see how can we create condition and determine imputation value

Missing value treatment - Number Of Followups

Print and check missing values and see how can we create condition and determine imputation value

Drop CustomerID Column

Since CustomerID has no relation with other features and it is row number we can drop this column

Missing values treatment complete.

all missing values are treated and now we have complete data

Exploratory Data Analysis And Data processing

Visualize all features before any data clean up and understand what data needs cleaning and fixing.

Initial Univariate analysis and relation with Target feature

Univariate analysis helps to check data skewness and possible outliers and spread of the data.

creating a method that can plot univariate chart with histplot, boxplot and barchart %

Data nature by columns

Boolean Features

Numerical Features

Categorical Features

convert object to category types

All objects converted to category types

Checking all Boolean type features

Check how ProdTaken data is distributed

observations on ProdTaken

Check how OwnCar data is distributed

observations on OwnCar

Check how Passport data is distributed

observations on Passport

Checking Numerical Features

Check how Age data is distributed

observations on Age

Check how Number Of Trips data is distributed

observations on NumberOfTrips

Check how Monthly Income data is distributed

observations on MonthlyIncome

Check how Number Of Followups data is distributed

observations on NumberOfFollowups

Check how Duration Of Pitch data is distributed

observations on DurationOfPitch

Checking Categorical Columns

Check how Type of Contact data is distributed

observations on TypeofContact

Check how Occupation data is distributed

observations on Occupation

Check how Gender data is distributed

lets fix spelling mistake Fe Male to Female - using replace option

observations on Gender

Check how Marital Status data is distributed

observations on Marital Status

Check how Designation data is distributed

observations on Designation

Check how Product Pitched data is distributed

observations on ProductPitched

Check how City Tier data is distributed

observations on City Tier

Check how Number Of Person Visiting data is distributed

observations on Number Of Person Visiting

Check how Preferred Property Star data is distributed

observations on Preferred Property Star

Check how Number Of Children Visiting data is distributed

observations on Number Of Children Visiting

Check how Pitch Satisfaction Score data is distributed

observations on Pitch Satisfaction Score

Feature engineering & Data Cleaning

Feature Engineering - Age to Age Range - Create Bins - Create Age Bins - Replacement to Age numrical value

observations on AgeRange

Outlier Treatments & Data Transformation - Data Pre-processing

Cheking data falls outside IQR - 4*IQR Range

Observations on Duration Of Pitch Outliers

Chekcing outliers on income

Observations on income Outliers

Chekcing outliers on NumberOfTrips

Observations on NumberOfTrips

Numerical Features - Checking Data distributions

Lets check features with log transformation

observations

Bivariate Analysis

Data correlation analysis

Observations on corelations

Pair Plot

Observations on pair plots

Checking Target Variable relation with other features

Observations

Observations

Observations

Observations

Observations

Checking Target Variable relation with Numerical Features

MonthlyIncome vs ProdTaken

Observations

NumberOfTrips vs ProdTaken

Observations

DurationOfPitch vs ProdTaken

Observations

NumberOfFollowups vs ProdTaken

Observations

NumberOfChildrenVisiting vs ProdTaken

Observations

PreferredPropertyStar vs ProdTaken

Observations

NumberOfPersonVisiting vs ProdTaken

Observations

CityTier vs ProdTaken

Observations

Observations

Monthly Income vs NumberOfPersonVisiting with ProdTaken

Observations

Monthly Income vs DurationOfPitch with ProdTaken

Observations

Observations

Monthly Income vs NumberOfFollowups with ProdTaken

Observations

Monthly Income vs PreferredPropertyStar with ProdTaken

Observations

Data Post Cleanup/Outlier Treatment & Feature Engineering

Understand the shape of the dataset.

observations on data

Check the data types of the columns in the dataset.

checking data types of all columns

Summary of the data

Insights based on EDA

Summary Missing Value/Outlier/Feature Engineering Treatments

ProdTaken

OwnCar

Passport

Age - Convered to AgeRange

Number Of Trips

Monthly Income

Number Of Followups

Duration Of Pitch

Type of Contact

Occupation

Gender

Marital Status

Designation

Product Pitched

City Tier

Number Of Person Visiting

Preferred Property Star

Number Of Children Visiting

Pitch Satisfaction Score

Monthly Income vs NumberOfPersonVisiting with ProdTaken

Monthly Income vs DurationOfPitch with ProdTaken

Monthly Income vs NumberOfTrips with ProdTaken

Monthly Income vs NumberOfFollowups with ProdTaken

Monthly Income vs PreferredPropertyStar with ProdTaken

Model Building - Approach

  1. Data preparation
  2. Partition the data into train and test set.
  3. Build model on the train data.
  4. Tune the model if required.
  5. Test the data on test set.

Split Data

Observation on Data Split

Model evaluation criterion

Model can make wrong predictions as:

  1. Predicting a customer will take tour package, But customer wont take
  2. Predicting a customer will not take tour package, But customer takes.

Which case is more important?

How to reduce this loss i.e need to reduce False Negatives?

Let's define function to provide metric scores(accuracy,recall and precision) on train and test set and a function to show confusion matrix so that we do not have use the same code repetitively while evaluating models.

Model - Default Decision Tree Model

Observations on Default Decision Tree Model

Model building - Bagging - Default Classifier

Observations on Default Bagging Classifier

Model - Default Bagging Classifier with weighted decision tree

Observations on Bagging Classifier with weighted decision tree

Model - Random Forest

Observations on Default Random Forest

Random forest with class weights

Model building - Boosting - Default Classifier

AdaBoost Classifier

observation

Gradient Boosting Classifier

observation

XGBoost Classifier

observation

Summary of all default models

Observations on all default models

Performance improvement - Tuning Models with hyperparameters

Model performance improvement - Bagging

Tuning Decision Tree

Observations on Tuned Decision Tree

Tuning Decision Tree

Observations on Tuned Decision Tree

Tuning Bagging Classifier

Observations on Tuned Bagging Classifier

Tuning Random Forest

Tuning Random Forest approach 2

Observations on Tuned Random Forest

Summary - bagging models - Comparing all the models

after trying multiple hyper parameter configuration, end up with current configuration that has good Recall and F1 Score.

Feature importance of Random Forest

Observations on Feature importance

Model performance improvement - Boosting

AdaBoost Classifier

Observations on tuned AdaBoost Classifier

Gradient Boosting Classifier

Observations on tuned Gradient Boost Classifier

XGBoost Classifier

Observations on tuned XGBoost Classifier

Note : Tuning was little difficult since it take more time to run so tried few options and decided based of recall score and f1 score.

Stacking Classifier

Observations on Stacking Classifier

Summary - boosting models - Comparing all the models

after trying multiple hyper parameter configuration, end up with current configuration that has good Recall and F1 Score.

Summary - Comparing all the models - To Pick best model

After checking all models Recall and F1 Score. Stacking Model produces better recall and F1 Score.

Feature importance - XGBoost Tuned

Business Insights and Recommendations

Key Features

Key Insights - Company should target customers with below to improve sales